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Abstract — A  common  problem  in  modern  graph  analysis  is  the 
detection  of  communities,  an  example  of  which  is  the  detection 
of  a  single  anomalously  dense  subgraph.  Recent  results  have 
demonstrated  a  fundamental  limit  for  this  problem  when  using 
spectral  analysis  of  modularity.  In  this  paper,  we  demonstrate 
the  implication  of  these  results  on  community  detection  when 
a  cue  vertex  is  provided,  indicating  one  of  the  vertices  in 
the  community  of  interest.  Several  recent  algorithms  for  local 
community  detection  are  applied  in  this  context,  and  we  compare 
their  empirical  performance  to  that  of  the  simple  method  used 
to  derive  the  theoretical  spectral  limits. 

I.  Introduction 

In  many  applications,  the  data  of  interest  take  the  form  of 
entities  and  the  relationships  between  them.  These  may  rep¬ 
resent  a  broad,  diverse  set  of  data  types,  from  communication 
between  people  to  interactions  between  proteins.  In  all  of  these 
diverse  contexts,  the  relational  data  are  typically  represented 
as  a  graph. 

One  of  the  common  problems  among  analysts  working  with 
graph-based  data  is  subgraph  detection.  Given  a  large  set  of 
entities  and  their  relationships,  connections,  and  interactions, 
it  can  be  difficult  to  determine  if  there  is  a  particular  subset 
of  entities  that  requires  special  attention  [1],  [2],  Typically, 
the  objective  is  to  find  a  relatively  small  set  of  vertices  whose 
topology  is  inconsistent  with  some  notion  of  expected  behavior 
in  the  graph.  The  classical  planted  clique  problem  embodies 
this  in  a  simple  form. 

In  the  planted  clique  problem,  the  objective  is  to  locate  a 
subset  where  all  possible  connections  exist,  when  connections 
across  the  rest  of  the  graph  occur  with  a  fixed  probability. 
This  simplified  scenario  has  enabled  the  derivation  of  hard 
detectability  limits  [3],  [4].  While  simplified  for  mathematical 
tractability,  this  problem  yields  valuable  insight  into  detectabil¬ 
ity  in  more  complicated  networks  derived  from  real  data. 

The  planted  clique  problem  is  traditionally  focused  on 
uncued  detection,  i.e.,  determining  the  nodes  that  comprise 
the  clique  without  any  additional  information  about  which 
entities  are  interesting.  In  practice,  however,  typically  some 
additional  knowledge  priors  are  available.  For  example,  in  an 
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advertising  application  in  a  social  network,  a  company  may 
have  knowledge  that  a  person  uses  their  product,  and  wants  to 
advertise  to  other  network  users  who  have  close  relationships 
with  their  current  customer.  Upfront  knowledge  priors  enable 
more  efficient  use  of  resources  by  targeting  a  search  that  could 
consider  the  entire  graph  by  having  it  prioritize  entities  in  the 
graph  that  are  close  to  the  cue.  Understanding  the  implications 
that  recent  subgraph  detection  bounds  have  on  the  setting 
where  a  cue  is  present  will  improve  our  understanding  of 
detectability  in  this  common  alternative  context. 

In  this  paper,  we  investigate  the  implication  of  recent 
spectral  limits  of  planted  clique  detection  to  cases  where  one 
entity  in  the  clique  is  revealed.  Using  a  simple  method  to 
reduce  the  number  of  entities  considered,  we  can  directly  apply 
current  bounds  for  uncued  detection  to  the  reduced  dataset. 
The  resulting  bounds  show  that,  under  the  right  circumstances, 
it  is  possible  to  detect  a  clique  that  reduces  its  size  as 
the  overall  graph  gets  larger.  We  demonstrate  empirically 
that  current  cued  subgraph  detection  methods  go  through  a 
detectability  phase  transition  at  the  same  point  as  the  simple 
filteringmethod,  suggesting  that  the  analysis  applied  here  has 
implications  for  performance  using  several  different  methods. 

The  remainder  of  this  paper  is  organized  as  follows.  Sec¬ 
tion  II  formalizes  the  problem  and  defines  our  notation.  In 
Section  III,  we  review  the  recent  spectral  bounds  on  uncued 
planted  clique  detection,  and  their  extension  to  planted  dense 
subgraph  detection.  Section  IV  derives  an  extension  of  these 
results  to  cases  where  a  cue  provides  a  simple  up-front  entity 
filtering  method.  In  Section  V,  we  define  a  set  of  experiments 
in  which  we  compare  several  cued  subgraph  detection  algo¬ 
rithms  from  the  open  literature  to  the  simple  filtering  method, 
and  Section  VI  outlines  the  results  of  these  experiments.  We 
conclude  the  paper  in  Section  VII  with  a  brief  summary  and 
directions  for  future  work. 

II.  Problem  Model 
A.  Definitions  and  Notation 

In  the  problem  we  consider,  we  are  given  a  graph  G  = 
(V.:E),  which  is  comprised  of  a  set  of  vertices  V  (representing 
entities),  and  a  set  of  edges  E  (the  relationships  between  the 
entities).  We  denote  the  number  of  vertices  in  the  graph  by 
N  =  V  | .  There  is  inherently  a  subgraph  of  interest,  whose 


vertices  are  denoted  by  Vs  C  V,  and  its  size  is  denoted 
k  =  Vs  | .  The  graphs  considered  in  this  paper  are  unweighted 
(meaning  connections  either  exist  or  do  not,  with  no  notion 
of  magnitude)  and  undirected  (meaning  all  connections  are 
bidirectional). 

Since  the  bounds  we  derive  are  based  on  spectral  methods, 
we  will  make  use  of  matrix  representations  of  the  graph.  The 
adjacency  matrix  A  =  {atj}  of  the  graph  G  is  an  Nx  N  matrix 
where  a-jj  is  nonzero  only  if  there  is  an  edge  in  E  between 
vertices  vr  and  Vj.  (This  requires  an  arbitrary  labeling  of  the 
vertices  with  integers  from  1  to  N.)  Since  G  is  unweighted,  A 
will  be  binary,  and  since  G  is  undirected,  A  will  be  symmetric. 
The  degree  of  Vj  (the  number  of  edges  connected  to  it)  is 
denoted  by  di. 

Other  matrix  representations  of  graphs  have  been  used  in 
the  community  detection  literature.  The  graph  Laplacian  has 
been  used  to  approximate  the  solution  to  the  min-cut  problem, 
where  the  objective  is  to  make  the  graph  disconnected  by 
removing  the  smallest  number  of  edges.  The  Laplacian  is 
defined  as 

L  :=  D  —  A,  (1) 

where  I?  is  a  diagonal  matrix  where  the  entry  in  row  i  and 
column  i  is  di.  When  there  is  some  notion  of  the  probability 
of  connections,  the  modularity  matrix  has  also  been  used  for 
community  detection  [5],  This  matrix  is  used  to  optimize  the 
partition  of  a  graph  according  to  a  different  criterion:  creating 
a  partition  where  there  are  a  greater-than-expected  number  of 
edges  on  either  side  of  the  partition,  and  fewer  edges  than 
expected  crossing  it.  The  modularity  matrix  is  defined  as 

B  :=  A  —  E  [A] .  (2) 

Thus,  B  represents  the  residuals  obtained  when  subtracting  the 
expected  adjacency  matrix  from  the  observed.  In  the  traditional 
planted  clique  problem,  the  background  graph  is  an  Erdos- 
Renyi  random  graph,  i.e.,  a  graph  where  each  pair  of  vertices 
shares  an  edge  with  equal  probability  p. 

B.  Cued  Subgraph  Detection 

In  the  cued  subgraph  detection  problem,  we  observe  the 
graph  G  and  are  given  a  cue  vertex  vc  £  Vs-  Our  objective 
is  to  determine  the  remainder  of  Vs-  This  is  typically  done 
by  computing  a  test  statistic  z(y)  for  each  v  £  V  \  {vc},  and 
estimating  the  subgraph  of  interest  to  be 

Vs  =  {vc}  U{«ef  \  {fc}  :  z(v)  >  t}, 

where  t  is  a  threshold  that  can  be  varied.  Each  of  the  algo¬ 
rithms  we  consider  in  Section  V  follows  this  format.  In  this 
paper,  we  evaluate  performance  based  on  receiver  operating 
characteristic  (ROC)  metrics.  Here,  empirical  probability  of 
detection  is 

|Vg  n  Vg|  - 1 
\VS\  - 1 


(where  1  is  subtracted  in  the  numerator  and  denominator  since 
we  do  not  account  for  the  cue  vertex  in  the  evaluation),  the 
empirical  false  alarm  rate  is 

_  \Vs\Vs\ 

Pfa  |U\US|  ’ 

and  overall  performance  of  a  detection  algorithm  is  evaluated 
based  on  the  area  under  the  ROC  curve  (AUC). 

III.  Uncued  Subgraph  Detection  Bounds 

We  will  specifically  consider  recent  spectral  bounds  for 
planted  clique  detection  [4],  This  work  proposed  a  simple 
algorithm  for  planted  clique  detection  by  thresholding  the 
principal  eigenvector  of  the  modularity  matrix  B.  and  showed 
that  there  is  a  sharp  detectability  threshold  that  can  be  derived 
via  a  random  matrix  theoretic  analysis  of  the  problem.  The 
algorithm  is  as  follows.  Compute  the  principal  eigenvector 
u  of  the  modularity  matrix  B,  computed  with  respect  to  an 
Erdos-Renyi  random  graph.  The  estimated  subgraph  is  then 
computed  as 

Us  =  {in  :  \y/Nui\  >  F^{0>1)  (l  -  |)  }  ,  (3) 

where  F^Q  1 .  is  the  inverse  cumulative  density  function  of  a 
standard  normal  distribution  and  a  is  the  desired  false  alarm 
rate.  This  algorithm  is  based  on  the  fact  that  the  modularity 
matrix  of  a  planted  clique  in  an  Erdos-Renyi  graph  is  well 
approximated  by  a  rank-1  perturbation  of  a  Wigner  matrix 
(a  symmetric  random  matrix  where  all  entries  have  zero  mean 
and  equal  variance),  which  has  a  known  eigenvalue  distribution 
that  enables  an  analytical  detection  bound.  The  entries  in 
the  eigenvectors  of  a  Wigner  matrix  also  appear  normally 
distributed  as  N  — ►  oo.  These  observations  yielded  the 
following  theorem. 

Theorem  3.1  (Nadakuditi  [4]):  Consider  a  k-ve rtex  clique 
planted  in  an  iV-vertex  graph  with  edge  probability  p,  where 
the  clique  vertices  are  identified  using  (3)  for  a  significance 
level  a.  Then,  for  fixed  p,  as  k,n  oo  such  that  k/y/n  — > 
/?  €  (0,  oo)  we  have 

„  .  fl  if  B  >  3cnt  \[ 

P(clique  discovered)  — >  <  V  1  v  (4) 

[a  otherwise. 

The  detection  threshold  is  based  on  the  relationship  between 
the  nonzero  eigenvalue  of  the  rank-1  perturbation,  0  =  k  ( 1 
p),  and  the  maximum  eigenvalue  of  the  random  background, 
R  =  yj4Np(l  —  p).  This  can  easily  be  extended  to  cases 
where  a  dense  subgraph  is  embedded  rather  than  a  clique.  If 
the  subgraph  has  a  probability  of  internal  connection  pin  >  p, 
it  will  be  detectable  if 

k{pin  ~p)>  \/Np(\  -p)  (5) 

as  N  — >  oo.  Note  that,  if  p  — >  0  as  N  — >  oo,  then  the  right 
hand  side  of  (5)  will  approach  the  square  root  of  the  average 
degree.  The  left  hand  side  will  approach  the  average  internal 
degree  of  the  subgraph  if  p  =  o(pin ),  and  will  approach  a 
constant  multiple  of  this  quantity  if  p  =  0(p2rl). 


Pd  = 


IV.  Cued  Subgraph  Detection  Bounds 

A.  Cued  Planted  Clique  Detection  Setting 

We  will  start  by  considering  the  planted  clique  problem 
when  one  of  the  clique  vertices  is  revealed.  Since  all  possible 
edges  exist  between  clique  vertices,  we  know  that  all  clique 
vertices  are  in  the  one -hop  neighborhood  of  the  cue  vertex.  We 
denote  by  Nt{v)  the  z-hop  neighborhood  of  v,  i.e.,  the  vertices 
that  can  be  reached  from  v  via  a  path  of  length  z  or  less.  This 
allows  a  simple  filtering  procedure  to  incorporate  the  cue:  We 
can  consider  only  Ni(vc)  rather  than  all  of  V.  Since  the  edges 
in  the  background  are  all  independent,  when  considering  the 
induced  subgraph  of  -/Vi(t>c)\{z;c}  (i.e.,  the  graph  consisting  of 
all  edges  in  E  that  occur  between  the  vertices  in  the  subset), 
the  objective  it  to  solve  another  planted  clique  problem.  In 
this  case,  the  clique  has  k  —  1  vertices,  and  the  size  of  the 
background  follows 

\Ni(vc)  \  {vc}\  =  k  -  1  +  dc,  (6) 

where  dc  is  drawn  from  the  binomial  distribution  B(N—k,p). 
We  can  use  this  fact  to  derive  bounds  for  the  cued  case  when 
applying  the  simple  spectral  algorithm  to  the  cue’s  one -hop 
neighborhood. 

B.  Bound  Derivation 

There  are  a  few  interesting  cases  for  planted  clique  detec¬ 
tion,  which  consider  different  growth  rates  for  the  background 
probability.  First,  consider  the  case  where  the  background 
probability  remains  constant  as  the  graph  grows.  In  this 
scenario,  the  average  degree  of  the  graph  grows  linearly  with 

N.  In  this  case,  the  distribution  B(N  —  k,p)  will  approach  a 
normal  distribution  Af((N  —  k)p ,  (N  —  k)p(  1  —  p)).  We  want 
to  determine  when  the  k—  1  clique  in  the  cue’s  neighborhood 
will  be  discovered  with  high  probability,  meaning  that 

(k-  1)(1  -p)  >  \J{k-  1  +  dc)p(l  -p).  (7) 

Assuming  k  =  o(N ),  the  dc  term  will  dominate.  As  N  grows, 
(N  —  k)p  +  C\/N  —  k  for  a  constant  C  will  be  a  fixed  number 
of  standard  deviations  from  the  mean  of  dc,  meaning  that  dc 
will  take  on  values  greater  than  this  with  fixed  probability. 
Thus,  by  considering  a  threshold  (N  —  k)p  +  C(N  —  k)0  5+s, 

O. 5  <  6  <  1,  we  capture  a  polynomially  increasing  number 
of  standard  deviations,  which  will  result  in  an  exponential 
reduction  in  the  probability  of  dc  crossing  the  threshold  as 
N  increases.  The  asymptotic  bound,  therefore  is 


The  threshold  value  for  the  clique  size  still  scales  as  the  square 
root  of  the  number  of  total  vertices,  but  it  can  be  a  constant 
factor  (y fp)  smaller  than  in  the  uncued  case. 

In  practice,  graphs  typically  do  not  increase  their  average 
degree  linearly  as  the  number  of  vertices  increases.  Studies 
have  shown  that  the  average  degree  often  follows  a  sublinear 
polynomial  [6],  Thus,  it  is  also  important  to  consider  cases 


where  the  average  degree  davg  is  0(NS),  0  <  5  <  1.  In 
this  scenario,  p  =  0(iV‘5_1),  so  the  (1  —  p)  terms  in  (7)  will 
approach  1.  By  a  similar  argument  to  the  constant  p  case, 
assuming  the  neighborhood  size  is  a  sublinearly  increasing 
number  of  standard  deviations  above  the  mean,  we  asymptot¬ 
ically  approach  a  detection  threshold  of 

k  >  VCN 2-5- 1  (8) 

for  a  constant  C.  In  this  case,  it  is  possible  that  the  detectable 
clique  size  can  actually  get  smaller  as  the  graph  grows,  since 
the  one-hop  neighborhood,  although  it  grows  slowly,  is  sparser. 
Using  a  variable  6  helps  demonstrate  behavior  for  various 
growth  patterns:  If  the  density  is  maintained  (5  — >  1),  the 
minimum  detectable  clique  size  grows  as  the  square  root  of 
the  size  of  the  graph,  whereas  if  the  average  degree  grows 
very  slowly  (6  — >•  0),  the  size  of  the  clique  can  decrease  at 
a  rate  close  to  1/y/N  and  be  detected  by  the  cued  method. 

C.  Extension  to  Dense  Subgraphs 

Considering  dense  subgraphs  rather  than  cliques,  it  may 
not  be  the  case  that  the  entire  subgraph  is  in  the  one-hop 
neighborhood.  One  interesting  question  in  this  case  is  when 
multiple  hops  improve  detectability.  For  the  sake  of  simplicity, 
consider  the  expected  value  of  the  neighborhood  size,  E[dc]  = 
kpin  +  (N  —  k)p,  which,  for  large  N,  will  be  approximately 
Np.  The  number  of  additional  background  nodes  added  in  the 
second  hop  is  approximated  by  (N  —  k—  Np)(l  —  (1  —p)Np). 
For  small  p  and  large  N,  this  is  asymptotically  quadratic  in 
p  and  N,  behaving  like  0(iV2p2),  i.e.,  the  average  degree 
squared.  For  large  pin,  most  of  the  dense  subgraph  will  be 
captured  in  the  first  hop,  and  the  additional  vertices  will 
hurt  performance.  If,  on  the  other  hand,  the  subgraph  edge 
probability  is  relatively  small,  then  multiple  hops  will  similarly 
expand  the  number  of  subgraph  vertices  available  for  the 
spectral  algorithm  to  detect.  The  number  of  subgraph  vertices 
gained  from  neighbors  within  the  subgraph  is  0(k2p2n),  and 
the  number  gained  from  external  neighbors  is  0(Nkp2).  The 
planted  clique  will  be  detectable  in  the  two-hop  neighborhood 
if  either  k2p^n  or  Nkp2pin  grows  faster  than  Np 3/2. 

V.  Experiments 

In  the  previous  section  we  theoretically  analyzed  the  per¬ 
formance  gain  achieved  by  providing  a  cue  vertex  to  the 
spectral  method.  In  recognition  of  the  importance  of  localized 
community  detection  approaches,  there  has  been  a  prolif¬ 
eration  of  techniques  that  follow  different  perspectives  of 
incorporating  partial  knowledge  in  the  solution.  We  consider 
a  few  representative  techniques  from  this  literature,  two  local 
spectral  algorithms  (MOVCUT  and  Quadratic  Programming), 
and  two  local  random  walk  algorithms  (Approximate  Person¬ 
alized  PageRank  and  Threat  propagation).  We  compare  their 
empirical  performance  to  the  cue-based  spectral  method  in 
both  the  clique  and  subgraph  detection  setting,  for  various 
degrees  of  problem  difficulty.  We  first  give  a  brief  description 
of  each  algorithm  and  show  in  Section  VI  that  the  different 
notions  of  locality  they  utilize  lead  to  different  empirical 
performance  for  hard  to  detect  cases. 


A.  MOVCUT 

The  MOVCUT  algorithm  [7]  extends  the  traditional  spec¬ 
tral  clustering  formulation  by  adding  a  constraint  that  only 
considers  solution  vectors  x  that  correlate  well  with  the  seed 
vector  s.  Given  a  correlation  parameter  k,  the  local  spectral 
optimization  problem  is  written  as  follows: 

min  xtTx 

X 

subject  to  xTxT  =  1, 

xtD1/21  =  0, 

(*TDl'2s)2  >  k. 

The  solution  vector  is  expressed  by: 

x*  =  c(L  —  7_D)+Ds, 

where  c  £  [0,  oo]  is  a  normalization  constant  to  make  the 
solution  x*  a  unit  normed  vector,  and  7  £  (— oo,A2(G)) 
ensures  that  x*  is  found  exactly  on  the  boundary  of  the  feasible 
region.  [7]  showed  that  sweeping  through  the  locally  biased 
solution  x*  has  analogous  theoretical  guarantees  to  the  tra¬ 
ditional  spectral  clustering  solution.  The  MOVCUT  algorithm 
combines  both  global  and  local  aspects  of  graph  structure.  The 
x*  vector  is  still  a  solution  to  a  global  optimization  problem, 
yet  the  restriction  that  it  correlates  to  the  seed  by  at  least  k 
ensures  that  volume  of  the  cut  is  no  bigger  than  k  therefore 
localizing  the  output. 

B.  Quadratic  Programming 

The  quadratic  programming  algorithm  is  another  local  spec¬ 
tral  algorithm  that  we  consider.  In  contrast  to  MOVCUT,  it 
uses  the  modularity  matrix  which  emphasizes  the  fact  that 
we  would  like  to  identify  subgraphs  with  unexpected  density 
(relative  to  some  null  distribution).  In  addition,  the  objective  of 
this  algorithm  directly  incorporates  the  knowledge  of  the  seed 
in  the  solution  vector.  Formally,  the  quadratic  programming 
algorithm  optimizes  the  following: 

min  xT  (pi  —  B)x 

X 

subject  to  Xi  <  0,  i  ^  s, 
xs  <  -1, 

where  p  is  its  2-norm  of  the  modularity  matrix  B. 

C.  Approximate  Personalized  PageRank  (APP) 

Within  the  class  of  random  walk  partitioning  algorithms,  the 
personalized  PageRank  algorithm  [8]  has  been  used  to  rank  the 
importance  of  vertices  relative  to  an  input  seed  vertex  s.  The 
solution  to  the  personalized  PageRank  problem  is  expressed 
as  follows: 

r  =  as  +  (1  —  a)D~1Ar, 

where  a  is  the  teleportation  probability  to  s.  This  solution  can 
be  re-written  in  the  form: 

r  =  (L  -f - D)~1Ds 

a 


to  emphasize  the  connection  between  the  spectral  and  random 
walk  solutions  with  7  =  —  — - . 

Andersen  et  al.  [9]  developed  an  algorithm  that  approxi¬ 
mates  the  personalized  PageRank  solution  by  iteratively  dis¬ 
tributing  probabilities  (vertex  ranking  scores)  in  a  way  that 
favors  the  region  near  the  seed  vertex.  The  PageRank  solution 
r  is  expressed  as  an  approximate  vector  ?  plus  a  residual  vector 
e:  r  =  r  +  e.  The  initial  residual  vector  is  the  indicator  vector 
for  seed  s.  Given  s,  the  algorithm  moves  an  a  fraction  of 
the  probability  from  es  to  fs.  It  then  distributes  the  remaining 
1  —  a  probability,  half  to  itself  and  half  to  its  neighbors  in 
magnitude  proportional  to  their  degree.  The  algorithm  repeats 
until  a  large  portion  of  the  probability  has  been  pushed  back  to 
the  approximate  solution  vector  r.  [9]  showed  that  sweeping 
through  their  local  approximation  PageRank  vector  offers 
similar  guarantees  to  known  Cheeger  inequality.  Note  that  the 
APP  algorithm  is  a  true  local  algorithm  in  that  it  only  uses 
local  knowledge  of  the  neighborhood  around  a  vertex  to  update 
PageRank  scores. 

D.  Threat  Propagation 

Threat  propagation  algorithm  [10]  is  similar  to  the  class 
of  personalized  PageRank  algorithms,  but  has  the  following 
distinguishing  features.  It  views  the  graph  partitioning  problem 
as  a  2n  multiple  hypothesis  test  problem,  where  membership 
(to  the  cut  set)  or  non-membership  needs  to  be  determined 
for  all  the  vertices.  It  maximizes  the  Bayesian  probability  of 
detection  by  computing  the  harmonic  solution  to  Laplace’s 
equation,  but  treats  this  as  a  boundary  value  problem  with 
seeds  representing  the  boundary  values  and  unknown  values 
representing  the  interior.  Also,  instead  of  considering  a  con¬ 
stant  diffusion  probability  1  —  a,  it  considers  non-uniform 
diffusion  probabilities  inversely  proportional  to  the  average 
path  lengths  between  the  seed  vertex  and  other  vertices.  This 
modification  biases  diffusion  towards  regions  of  the  graph  that 
are  tightly  connected  to  the  seed  vertex,  therefore  implicitly 
leading  to  localized,  sparse  solutions  around  the  seed.  The 
algorithm  is  proved  to  be  optimum  in  the  Neyman-Pearson 
sense  of  maximizing  the  probability  of  detection  at  a  fixed 
false  alarm  probability  [10]. 

VI.  Empirical  Results 

Results,  in  the  form  of  an  ROC  curve  for  a  single  parameter 
setting,  are  given  in  Figure  1.  This  figure  shows  the  average 
ROC  curves  of  each  methodology  over  100  graphs  with 
\V\  =  1000  and  per  =  0.2.  The  mean  and  standard  deviations 
of  the  area  under  the  ROC  curves  across  all  parameter  settings 
are  given  in  Table  I.  Best  performing  methodologies  in  a 
specific  parameter  setting  are  highlighted  in  bold.  This  table 
outlines  the  algorithmic  performance  for  the  20  vertex  planted 
clique  and  planted  dense  supgraph  experiments.  In  these 
results,  we  observe  good  detection  performance  on  the  sparser 
background  networks  (p  =  0. 1  to  p  ■-  0.2).  Generally,  on  these 
relatively  simple  problems,  variation  on  detection  performance 
as  measured  by  mean  AUC  is  low. 


For  the  planted  dense  subgraph  experiment,  we  observe  that 
as  networks  grow  more  dense,  the  most  effective  methodology 
shifts  from  the  quadratic  programming  based  algorithm  to 
Approximate  PageRank.  In  the  planted  clique  experiments,  we 
observe  the  same  two  algorithms  generally  outperforming  the 
rest. 


Fig.  1.  An  ROC  curve  showing  detection  performance  of  vertices  in  the 
neighborhood  of  a  cue  vertex  embedded  in  a  pin  =  0.8  subgraph.  The  back¬ 
ground  network  generated  is  an  Erdos-Renyi  network  with  an  edge  probability 
of  0.2.  Line  colors  represent  the  performance  of  different  methodologies. 

VII.  Summary 

In  this  paper,  we  extend  recent  bounds  based  for  planted 
clique  detection  to  cases  where  one  of  the  clique  vertices  is 
revealed.  We  show  that  this  reduces  to  the  problem  of  finding 
a  smaller  clique  within  a  smaller  random  background,  and 
that  the  same  random  matrix  theory  analysis  holds  after  an 
initial  filtering  of  the  vertices.  The  resulting  bounds  show  that 
a  clique  can  be  detected  that  grows  more  slowly  than  required 
in  the  uncued  case  by  a  factor  of  the  edge  probability,  which 
implies  that,  when  the  average  degree  grows  very  slowly, 
smaller  cliques  can  be  detected  as  the  total  number  of  vertices 
increases.  Considering  4  cued  subgraph  detection  methods 
from  the  open  literature,  we  show  that  a  phase  transition 
occurs  for  these  methods  as  it  also  occurs  for  the  simple 
method  of  applying  a  spectral  detection  method  to  the  one- 
hop  neighborhood  of  the  graph. 

From  this  point,  a  number  of  future  directions  are  possible 
for  this  research.  Understanding  the  limits  of  cued  detection 
in  graphs  with  community  structure  and  arbitrary  degree 
distributions  is  one  important  area.  This  will  be  complicated 
by  the  dependence  on  where  the  subgraph  is  placed  (on  high- 
or  low-degree  vertices,  within  a  single  background  community 
or  across  several,  etc.).  Another  interesting  result  would  be  to 
consider  methods  for  proving  cued  detectability  not  relying 
on  the  same  random  matrix  theoretic  analysis  as  used  here. 
It  is  possible  that,  for  flow-based  algorithms,  other  analytic 
techniques  may  be  more  appropriate.  Even  considering  random 


matrix  theory  techniques,  it  would  be  ideal  to  compute  a  bound 
directly  for  the  cued  case,  rather  than  using  the  uncued  bound 
on  a  filtered  subset. 
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TABLE  I 

Table  of  AUC  Means  (Standard  Deviations)  for  20  Vertex  Planted  Subgraph 


Planted  Dense  Subgraph  pin  =  0.8 

V 

Method 

0.1 

0.2 

0.3 

0.4 

0.5 

0.6 

ApproxPR 

MovCut 

OneHop 

QuadProg 

ThreatProp 

0.972  (0.021) 
0.997  (0.009) 
0.900  (0.058) 
1.000  (0.0001) 
0.835  (0.182) 

0.863  (0.056) 
0.837  (0.086) 
0.887  (0.071) 
0.947(0.046) 
0.722  (0.147) 

0.785  (0.074) 
0.689  (0.085) 
0.663  (0.151) 
0.811  (0.073) 
0.646  (0.121) 

0.733  (0.096) 
0.635  (0.097) 
0.549  (0.083) 
0.701  (0.077) 
0.618  (0.086) 

0.718  (0.114) 
0.563  (0.084) 
0.513  (0.067) 
0.615  (0.088) 
0.566  (0.091) 

0.692  (0.120) 
0.556  (0.071) 
0.511  (0.053) 
0.575  (0.095) 
0.538  (0.089) 

Planted  Clique 

V 

Method 

0.1 

0.2 

0.3 

0.4 

0.5 

0.6 

ApproxPR 

MovCut 

OneHop 

QuadProg 

ThreatProp 

0.999  (0.001) 

1  (0) 

1  (0) 

1  (0) 

0.921  (0.123) 

0.946  (0.024) 
0.992  (0.011) 

1  (0) 

1.000  (0.0005) 
0.883  (0.112) 

0.861  (0.053) 
0.874  (0.077) 
1.000  (0.0003) 
0.980  (0.016) 
0.760  (0.136) 

0.793  (0.084) 
0.750  (0.083) 
0.901  (0.135) 
0.892  (0.053) 
0.705  (0.094) 

0.728  (0.117) 
0.678  (0.078) 
0.558  (0.128) 
0.797  (0.075) 
0.656  (0.084) 

0.698  (0.107) 
0.617  (0.086) 
0.519  (0.088) 
0.699  (0.077) 
0.607  (0.074) 

